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© Speech recognition apparatus and methods. 

® Speech recognition apparatus has a television 
camera or optical array 6 that views the mouth of a 
speaker, and a larynogograph 20 that detects move- 
ment of the vocal folds. A microphone 5 produces 
an output in respect of the sound produced by the 
speaker. After visual processing, signals from the 
camera 6 are supplied to a pattern matching unit 63 
together with signals from a store 64. Signals from 
the teprngograph are also supplied to the pattern 
matching unit to resolve ambiguity between speech 
sounds wrthsimilar mouth shapes, the output of the 
microphone is supplied to a noise adaptation unit 52 
together with laryngograph output which enables ex- 
ternal noise to be identified and rejected. After noise 
adaptation, the microphone output is subjected to 
r» pattern matching with a vocabulary 54 of reference 
^templates. Signals representing the most likely fit 
O) after the two pattern matching processes are sup- 
O plied to a comparator 70 which selects the most 
^likely word spoken, or signals the speaker, via a 
^feedback device 82, to repeat the word spoken. 
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SPEECH RECOGNITION APPARATUS AND METHODS 



This invention relates to speech recognition 
apparatus. 

In complex equipment having multiple func- 
tions it can be useful to be able to control the 
equipment by spoken commands. This is also use- 
ful where the user's hands are occupied with cither 
tasks or where the user is disabled and is unable to 
use his hands to' operate conventional mechanical 
switches and controls. 

The problem with equipment controlled by 
speech is that speech recognition can be unrelia- 
ble, especially in noisy environments. This can lead 
to. failure to operate or, worse still, to incorrect 
operation. 

Speech signal processing can also be used in 
communication systems, where the speech input is 
degraded by noise, to improve the quality of 
speech output This generally involves filtering and 
signal enhancement but usually results in some 
loss of the speech information where high noise is 
present 

ft is the object of the present invention to 
provide speech recognition apparatus and methods 
that can be used to improve speech handling. 

According to one aspect of the present inven- 
tion there is provided speech recognition appara- 
tus, characterised in that the apparatus includes an 
optical device mounted to view a part at least of 
the mouth of a speaker, the optical device provid- 
ing an output that varies with movement of the 
speaker's mouth, voicing sensing apparatus that 
detects movement of the speaker's vocal folds 
such as thereby to derive information regarding the 
voiced sounds of the speaker, and a processing 
unit that derives from the output of the optical 
device and the voicing sensing apparatus informa- 
tion as to the speech sounds male by the speaker. 

The voicing sensing apparatus preferably in- 
cludes a laryngograph responsive to changes in 
impedance . to electromagnetic radiation during 
movement of the vocal folds. The apparatus may 
include a store containing Information of a refer- 
ence vocabulary of visual characteristics of the 
speaker's mouth, and a first pattern matching unit 
that selects the closest match between the output 
of the optical device and the vocabulary in the 
store and provides an output representative of the 
selected speech sounds. The output from the voic- 
ing sensing apparatus may be supplied to the first 
pattern matching unit to improve identification of 
the speech sounds. 

The apparatus may include a microphone that 
derives an output in respect of the sound produced 
by the speaker, and a comparator that compares 
the output from the microphone with information 



derived from the optical device such as to derive 
information concerning speech sounds made by 
the speaker. The output of the vooicing sensing 
apparatus and the microphone are preferably com- 
5 bined in order to identify sound originating from the 
speaker and sound originating from external sour- 
ces. 

The apparatus preferably includes a store con- 
taining a reference vocabulary of sound signal in- 

ro formation, a second pattern matching unit con- 
nected to the store and connected to receive the 
output of the microphone after rejecting signals 
associated with sounds originating from exteran) 
sources. The comparator preferably receives the 

15 outputs of the first and second pattern matching 
units and provides an output representing the most 
probable speech sounds made in accordance 
therewith. 

The apparatus may include a circuit that modi- 
20 ties the output of the microphone. The output of 
the microphone may modified by the output of the 
first pattern matching unit, the output of the micro- 
phone being supplied to the second pattern match- 
ing unit after modification by the first pattern 
25 matching unit 

Speech recognition apparatus and its method 
of operation, in accordance with the present inven- 
tion, will now be described, by way of example, 
wfth reference to the accompanying drawings, in 
30 which: 

Figure 1 is a side elevation view of a user 
wearing a breathing mask; 

Figure 2 is a front elevation view of the 
mouth of the user; 
35 Figure 3 illustrates schematically the appara- 

tus; and 

Figure 4 illustrates schematically atematfve 
apparatus. 

With reference first to Figures 1 and 2, there is 
40 shown a speaker wearing a breathing mask 1 hav- 
ing an air supply line 2 that opens into the mask on 
the side. An exhaust valve, not shown, is provided 
on the other side in the conventional way. The 
mask 1 also supports a microphone 5 that is lo- 
45 cated to detect speech by the user and to supply 
electrical output signals on line 50, in accordance 
with the speech and other sounds within the mask, 
to a speech recognition unit 10. 

Also mounted in the mask 1 is a small, ligh- 
so tweight CCD television camera 6 which is directed 
to view the region of me user's mouth including an 
area immediately around the mouth, represented 
by the pattern in Figure 2. Preferably, the camera 6 
is response to infra-red radiation so that it does not 
require additional illumination. Alternatively, the 
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camera 6 may be responsive to visible or ultra- 
violet radiation ff suitable illumination is provided. 
Signals from the camera 6 are supplied via line 60 
to the speech recognition' unit 10. 

The speech recognition apparatus also in- 
cludes a laryngograph 20 of conventional construc- 
tion such as described in ASHA Reports 11, 1981, 
p 116 -127. The lamgograph 20 includes two elec- 
trodes 21 and 22 secured to the skin of the user's 
throat by means of a neck band 23. The electrodes 
21 and 22 are located on opposite sides of the 
neck, level with the thyroid cartilage. Each elec- 
trode 21 and 22 is flat and circular in shape being 
between 15 and 30mm in diameter, with a central 
circular plate and a surrounding annular guard ring 
insulated from the central plate. One electrode 21 
is connected via a coaxial cable 24 to a supply unit 
25 which applies a 4MHz transmitting voltage be- 
tween the central plate and guard ring of the elec- 
trode. Typically, about 30mW Is dissipated at the 
surface of the user's neck. The other electrode 22 
serves as a current pick-up. Current flow through 
the user's neck will vary according to movement of 
the user's vocal folds. More particularly, current 
flow increases (that is, impendance decreases) 
when the area of contact between the vocal folds 
increases, although movement of the vocal folds 
which does not vary the area of contact will not 
necessarily produce any change in current flow. 

The output from the second electrode 22 is 
supplied on One 26 to a processing unit 27. The 
output signal is modulated according to the fre- 
quency of excitation of the vocal tract and thereby 
provides information about phonation or voiced 
speech, of the user. This signal is unaffected by 
external noise and by movement of the user's 
mouth and tongue. The processing unit 27 provides 
an output signal on line 28 in accordance with the 
occurrence and frequency of voiced speech, this 
signal being in a form that can be handled by the 
speech recognition unit 10. 

With reference now also to Figure 3, signals 
from the microphone 5 are first supplied to a 
spectral analysis unit 51 which produces output 
signals in accordance with the freqency bands 
within which the sounds falls. These signals are 
supplied to a spectral correction and noise adapta- 
tion unit 52 which improves the signal to noise ratio 
or eliminates, or marks, those speech signals that 
have been corrupted by noise. The spectral correc- 
tion unit 52 also receives input signals from the 
laryngograph 20 on line 28. These signals are used 
to improve the Identification of speech sounds. For 
example, if the microphone 5 receives signals 
which may have arisen from voiced speech (that is, 
speech with sound produced by vibration of the 
vocal folds) or from external noise, which produces 
sounds similar to phonemes |z| in 'zero 1 or |i| in 



'hid' but there is no .output from the laryngograph 
20, then the sound can can only have arisen either 
from noise or from another class of sound cor- 
rupted by noise and is consequently marked as 
5 such. Output signals from the unit 52 are supplied 
to one input of a pattern matching unit 53. The 
other input to the pattern matching unit 53 is taken 
from a store 54 containing information form a refer- 
ence vocabulary of sound signal information in the 
10 form of pattern templates or word models of the 
freqnecy/time patterns or state descriptions of dif- 
ferent words. The pattern matching unit 53 com- 
pares the frequency/time patterns derived from the 
microphone 5 with the stored vocabulary and. pro- 
is duces an output on line 55 in accordance with the 
word which is the most likely fit for the sound 
received by the microphone. The output may in- 
clude information as to the probability that the word 
selected from the vocabulary is the actual word 

20 spoken. The output may also include signals repre- 
senting a plurality of the most likely words actually 
spoken together with their associated probabilities. 

The part of the unit 10 which processes the 
optical information from the camera 6 includes a 

25 visual processing unit 61 which receives the cam- 
era outputs. The visual processing unit 61 analyses 
the input signals to identify key characteristics of 
the optical speech patterns or optical model states 
from the visual field of the camera, such as, for 

qo example, lip and teeth separation and Hp shape. In 
this respect, welhknown optical recognition tech- 
niques can be used. 

The visual processing unit 61 supplies output 
signals on line 62 to one input of a pattern match- 

35 ing unit 63. A second input to the pattern matching 
unit 63 Is taken from a store 64 containing informa- 
tion of a reference vocabulary in the form of tem- 
plates of the key visual characteristics of the 
mouth. Signals from the laryngograph 20 on line 28 

40 are supplied to a third Input of the patten matching 
unit 63 to improve identification of the word spok- 
en. The output of the laryngograph 20 is used to 
resolve situations where there Is ambiguity of the 
sound produced from observation of mouth move- 

45 merit alone. For example, the sounds |s| and |zj will 
produce the same output from the visual process- 
ing unit 61, but only the sound |zj will produce an 
output from the larynogograph 20. This thereby 
enables the pattern matching unit 63 to select the 

so correct sound of the two alternatives. Similarly, 
some sounds which have the same mouth shape 
can be identified in the pattern matching unit 63 
because they are produced with voicing 
(phonation) of different freqencies. The pattern 

55 matching unit 63 provides an output on line 65 in 
accordance with the word that best fits the ob- 
served mouth movement The output may also 
include information as to the probability that the 
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word selected from the vocabulary is the actual 
word spoken. The output may also include signals 
representing a plurality of the most likely words 
actually spoken with their associated probabilities. 

The outputs from the two pattern matching 
circuits S3 and 63 are supplied to a comparison 
unit 70 which may function in various way. If both 
inputs to the comparison unit 70 indicate the same 
word then the comparison unit produces output 
signals representing that word on line 71 to a 
control unit 80 or other untilisation means. If the 
inputs to the comparison unit 70 indicate different 
words, the unit responds by selecting the word with 
the highest associated probability. Where the pat- 
tern matching units 53 and 63 produce outputs in 
respect of a plurality of the most likely words 
spoken, the comparison unit acts to select the word 
with the highest total probability. The comparison 
unit 70 may be arranged to give signals from one 
or other of the pattern matching units S3 or 63 a 
higher weighting than the other when selecting 
between conflicting inputs. 

If the comparison unit 70 fells to identify a 
word with sufficiently high probability ft supplies a 
feedback output on line 72 to a feedback device 82 
giving information to the user which, for example, 
prompts him to repeat the word, or asks him to 
verify that a selected word was the word spoken. 
The feedback device 82 may generate an audible 
or visual signal. A third output on fine 73 may be 
provided to an optional syntax selection unit (not 
shown) which is used in a known way to reduce the 
size of the reference vocabulary for subsequent 
words. 

The output signals on line 71 are supplied to 
the control unit 80 which effects control of the 
selected function In accordance with the words 
spoken. 

In operation, the user first establishes the refer- 
ence vocabulary in stores 54 and 64 by speaking a 
list of words. The apparatus then stores information 
derived from the sound, voicing and mouth move- 
ments produced by the spoken list of words for use 
in future comparison. 

A modification of the apparatus of Figure 3 is 
shown in Figure 4. In this modification it will be 
seen that a spectrum substitution unit 74 is inter- 
posed between the spectral correction and noise 
adaptation unit 52 and the pattern matching unit 53. 
This modification operates to substitute only short- 
term corrupted speech spectra with a •most-likely* 
description of uncorrupted short-term spectra. 
When the noise detection process caried out by 
unit 52 indicates that the acoustic spectrum output 
from the analysis unit 51 has been corrupted by 
noise, a clean- spectrum most likely to be asso- 
ciated with the visual pattern detected by the pat- 
tern matching unit 63 is supplied to the input of the 



unit 53 via a spectrum substitution unit 74 in place 
of the nosy spectrum otherwise supplied by the 
unit 52. The spectrum substitution unit 74 trans- 
forms the optical pattern recognised by the pattern 
5 matching unit 63 into an acoustic pattern with the 
same structure as the patterns produced at the 
outputs of the units 51 and 52. 

In the present invention, although noise may 
severly degrade the quality of the sound signal 
70 making acoustic recognition of the spoken words 
impossible, the optical output derived from the 
camera 6 and the output of the laryngograph 20 
will not be affected and this can be used to make a 
positive recognition: The invention is herefore par- 

75 ticularly useful in noisy environments such as fac- 
tories, vehicles, quarries, underwater, commodity or 
financial dealing markets and so on. 

in some circumstances it may not be neces- 
sary to use a microphone since the optical signal 

20 and laryngograph outputs may be sufficient to 
identify the words spoken. 

Various alternative optical means could be usd 
to view the user's mouth. In one example, the end 
of a fibre-optic cable may be located in the breath- 
es ing mask and a television camera mounted re- 
motely at the other end of the cable. Alternatively, 
an array of radiation detectors may be mounted in 
the breathing mask or remotely via fibre-optic ca- 
bles to derive signals in accordance with the posi- 

30 tion and movement of the user's mouth. 

Instead of using a laryngograph which detects 
movement of the vocal folds by change in imped- 
ance to transmission of high freqency electromag- 
netic radiation, various alternative voicing sensing 

36 means could be used. For example, it may be 
possible to sense voicing by ultrasound techniques. 

Where the user does not wear a breathing 
mask, the optical device can be mounted with his 
head by other means, such as in a helmet, or on 

40 the microphone boom of a headset ft is not essen- 
tial for the optical device to be mounted with the 
user's head although this does make it easier to 
view the mouth since the optical field will be in- 
dependent of head movement Where the optical 

45 device is not mounted on the user's head, addi- 
tional signal processing will be required to identify 
the location of the user's mouth. The laryngograph 
electrodes could be mounted on an extended collar 
of the user's helmet 

so It will be appreciated that the blocks shown in 
Figure 3 are only schematic and that the functions 
caried out by the blocks illustrated could be carried 
out by suitable programming of a single computer. 

55 
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Claims 

1; Speech recognition apparatus, characterised 
in that the apparatus includes an optical device (6) 
mounted to view a part at least of the mouth of a 
speaker, the optical device providing an output that 
varies with movement of the speaker's mouth, voic- 
ing sensing apparatus (20) that detects movement 
of the speaker's vocal folds such as thereby to 
derive information regarding the voiced sounds of 
the speaker, and a processing unit (10) that derives 
from the output of the optical device (6) and the 
voicing sensing apparatus (20) information as to 
the speech sounds made by the speaker. 

2. Speech recognition apparatus according to 
Claim 1, characterised in that the voicing sensing 
apparatus (20) includes a laryngograph responsive 
to changes in impedance to electromagnetic radi- 
ation during movement of the vocal folds. 

3. Speech recognition apparatus according to 
Claim 1 or 2, characterised in that the apparatus 
includes a store (64) containing information of a 
reference vocabulary of visual characteristics of the 
speaker's mouth, and a first pattern matching unit 
(63) that selects the closest match between the 
output of the optica) device (6) and the vocabulary 
in the store (64) and provides an output representa- 
tive of the selected speech sounds. 

4. Speech recognition apparatus according to 
Claim 3, characterised in that the output from the 
voicing sensing apparatus (20) is supplied to the 
first pattern matching unit (63) to improve iden- 
tification of the speech sounds. 

5. Speech recognition appratus according to 
any one of the preceding claims, characterised in 
that the apparatus includes a microphone (5) that 
derives an output in respect of the sound produced 
by the speaker, and a comparator (70) that com- 
pares the output from the microphone with informa- 
tion derived from the optica) device (6) such as to 
derive information concerning speech sounds 
made by the user. 

6. Speech recognition apparatus according to 
Claim 5, characterised in that the output of the 
voicing sensing apparatus (20) and the microphone 
(5) are combined in order to identify sound origi- 
nating from the speaker and sound originating from 
externa) sources. 

7. Speech recognition apparatus according to 
Claim 6, characterised in that the apparatus in- 
cludes a store (51, 54) containing a reference vo- 
cabulary of sound signal information, a second 
pattern matching unit (53) connected to the store 
(51, 54) and connected to receive the output of the 
microphone (5) after rejecting signals associated 
with sounds originating from externa) sources. 



8. Speech recognition apparatus according to 
Claim 3 or 4 and Claim 7, characterised in that the 
comparator (70) receives the outputs of the first 
and second pattern matching units (63 and 53) and 

s provides an output representing the most probable 
speech sounds made in accordance therewith. 

9. Speech recognition apparatus according to 
any one of Claims 5 to 8, characterised in that the 
apparatus includes a circuit (63, 74) that modifies 

10 the output of the microphone (5). 

10. Speech recognition apparatus according to 
Claims 7 and 9, characterised in that the output of 
the microphone (5) is modified by the output of the 
first pattern matching unit (63) and that the output 

75 -of the microphone (5) is supplied to the second 
pattern matching unit (53) after modification by the 
first pattern matching unit (63). 
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